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Abstract 

We present AMADA, an interactive web application to analyse multidimensional datasets. The user up¬ 
loads a simple ASCII file and AMADA performs a number of exploratory analysis together with contemporary 
visualizations diagnostics. The package performs a hierarchical clustering in the parameter space, and the user 
can choose among linear, monotonic or non-linear correlation analysis. AMADA provides a number of clus¬ 
tering visualization diagnostics such as heatmaps, dendrograms, chord diagrams, and graphs. In addition, 
AMADA has the option to run a standard or robust principal components analysis, displaying the results as 
polar bar plots. The code is written in R and the web interface was created using the SHINY framework. 
AMADA source-code is freely available at https://goo.gl/KeSPue, and the shiny-app at http://goo.gl/UTnU7I. 
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O 1. Introduction 

U 

^ The emerging precision era of astronomy marks 
I—'the transition from a data-deprived field to a data- 
pg driven science, in which statistical methods play a 
>■ central role. The need to handle these ever-increasing 
^ datasets impacts all branches of modern science, char¬ 
ts acterizing the so-called era of Big Data. As a conse- 
^ quence, an efficient exploration of high-dimensional 
^ datasets is becoming ubiquitous throughout all scien- 
O tific fields, such as biology (e.g., [Venter et al.[[2004| ), 
social sciences (e.g., [Patty and Penn[ [2015[ ), geol- 
• • ogy (e.g., [van Zyl[ [2014[) and astronomy (e.g., [Ball 


[and Brunner[ [2010[ [Graham et aH [20131 [Martinez- 
^ [Gomez et al.[ 2013 1. 

Upcoming surveys such as the Large Synoptic 


Survey Telescope (e.g., LSST Science Collaboration 


[et al.[[2009|), the Square Kilometre Array (e.g.,[Car- 


illi 20141, and Euclid (e.g., Scaramella et al.[ 20151, 


just to mention a few, will push the boundaries of 
our ability to analyse sky catalogs, while the ever- 
increasing complexity of cosmological simulations 


Email address: raf ael@caesar . elte . hu (R S. de 
Souza) 


keeps lessening the distance between observed and 
synthetic data (e.g., [Overzier et al.[[2013|[de Sou"^ 


et al.[[2013bl[2014bt[Vogelsberger et aL|[2014| ). 


An optimal exploration of these catalogs, observed 
and/or simulated, heavily relies on our ability to un¬ 
cover hidden relationships among different quanti- 


ties (e.g.. 

Borne et al. [[20081 Ball and Brunner[ 2010[ 

Graham et al. 

2013|), such as fundamental planes of 

galaxy properties (Tully and Fisher[ 1977[ Faber and 


Jackson[[1976), as well as to identify the optimal set 


of variables to describe and predict a certain prop¬ 
erty of interest (e.g. the presence of star formation 
activity in a halo; [de Souza et al.|2015[ ). 

A mainstay methodology for data exploration in 
astronomy is the correlation analysis. Its goal is to 
describe the level of association, usually linear, be¬ 
tween a given pair of variables. Its applicability vir¬ 
tually covers the entire astronomical domain, such as 


gamma-ray bursts (e.g., Burgess et al. 20141, cos- 

mic voids 

Hamaus et ahj 2014), star formation ac- 

tivity (Fee et al. 

2013 

, dark matter halo properties 

(de Souza et al. 

2013a 

2014a), and baryonic galaxy 

properties ( 

Yates et al.[ 2012), just to cite a few. 


To facilitate the use of contemporary exploratory 
and visualization techniques commonly used in other 


Preprint submitted to Astronomy and Computing 


June 19, 2015 




























































































































scientific fields but not fully exploited in astronomy, 
we developed the AM ADA package. The code al¬ 
lows the user to visualize subgroups of variables with 
high association in a hierarchical tree structure through 
diverse visual tools, such as graphs, chord diagrams, 
dendrograms and heatmaps. The goal is to deliver 
a user-friendly guide for a first data screening. By 
providing a systematic methodology for clustering 
detection in the space of object properties, the re¬ 
searcher can make a statistically justified decision 
about the subset of features to be studied in a given 
catalog. 

It is worth noting that other interfaces for data 


exploration in astronomy exist (e.g, Brescia et al. 

|2010t Burger et al.[ 2013^ Konstantopoulos[ 20151. 

Particularly, VOStat ( 

Chakraborty et al.[ 20131 and 

AstroStat (Kembhavi et al. 

2015 

) are two web-based 


services for statistical analysis using R under the hoodj 
Both projects are focused on providing a user-friendly ^ 
environment to perform a wide range of standard sta- ^ 
tistical analysis, such as hypothesis testing, multi- ^ 
variate analysis, clustering and so forth. However, g 
AMADA is the first of its kind with a primary focus 9 
on information visualization techniques for general 
correlation analysis in multidimensional catalogs. “ 


2. Main features 

AMADA is written in R 3.1.1 and developed us¬ 
ing RstudicQand Shin)|^ frameworks. RStudio is an 
open source interface for development of R applica¬ 
tions, and Shiny is a package that allows to build in¬ 
teractive web applications directly from R. Instruc¬ 
tions on how to run the code locally, and a brief in¬ 


stallation tutorial are given in Appendix A 


The package allows an interactive exploration and 
information retrieval from high-dimensional datasets. 
The user can choose among different methods for 
correlation analysis, whose outcomes are displayed 
in a chosen graphical layout for visual inspection. In 
the following, we briefly describe the main available 
features. 


*WWW.rstudio.com 
^ shiny.rstudio.com 


2.1. Datasets 

The user can upload a dataset in a plain text ASCII 
file as space or comma separated values (CSV). The 
columns should be named, and missing data should 
be marked as NA. An example of how a typical 
dataset looks like, together with a screenshot from 
the web portal, is displayed in Fig. Alternatively, 
the user can use the download data button to inspect 
on its own text editor how to format the matrix. The 
current version of AMADA does not allow an inter¬ 
active selection of columns. Therefore, we show be¬ 
low how it can be easily done in R command line 
using the c function: 


data( iris) 

colnames(iris)<-c( "SL" , "SW" , "PL" , "PW" , 

Species" ) 


head(iris) 




SL 

SW 

PL 

PW 

Species 

5.1 

3.5 

1.4 

0.2 

setosa 

4.9 

3.0 

1.4 

0.2 

setosa 

iris2<- 

iris [ 

,C( "SL 

" , "SW" 

)] 


head(iris 

2) 

SL 

SW 

5.1 

3.5 

4.9 

3.0 


The original column names of the famous iris dataset 
( |Fisher[[T936| ) are shortened in the example (S = sepal, 
P = petal, L = length, W = width) to save space. 

In addition, some public catalogs are already made 
available on the portal. In the following we will use 
two of them for explanatory purposes. As an ex¬ 
ample of low-dimensional and relatively small sam¬ 
ple we use a catalog of galaxies experiencing super¬ 
nova (SN) explosions, while as an example of high¬ 
dimensional and moderately large sample we use a 
mock galaxy catalog. More specifically, we apply 
AMADA to investigate: 


Supernova host galaxy properties ( |Sako et ah] 
20141). In this catalog the properties of Type 


la and II supernova host galaxies are retrieved 
from the Sloan Digital Sky Survey multi-band 
photometry. The available catalog represents 
a sub-sample of the original one, after removal 
of non-supemova objects and missing data. The 
final sample is composed of 443 (56) galax¬ 
ies hosting Type la (Type II) supernova, each 
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Figure 1; A screenshot of the AMADA portal showing properties of host galaxies of Type la supernovae. This portal is publicly 
available at http;//goo.gl/UTnU7I 


of them described by 10 parameters, such as 
galaxy age, star formation rate, distance from 
supernova to the host galaxy, and so forth. 


Galaxy properties ( |Guo et aL||2011| ). A mock 
galaxy catalog built using semi-analytic galaxy 
formation models and the N-body Millennium 
Simulations (Springel 2005). The initial data 
set is composed of « 180,000 haloes at red- 
shift 0. To avoid numerical artifacts due to low 
resolution effects, we select only those struc¬ 
tures with at least 300 particles (e.g., |Antonucci| o- 
[Delogu et al. 2010). In addition, we consider 
only central star forming galaxies (i.e., no satel¬ 
lite galaxies). The remaining dataset is com¬ 
posed of 7079 haloes, and each halo is described 
by approximately 30 parameters. 


As here we adopt the original nomenclature for the 
various quantities, we recommend the reader to re¬ 
fer to the original articles or catalogs for a detailed 
description of each parameter. 


visualization. Once the desired combination is cho¬ 
sen, the user should click on the button Make it so! to 
update the results. The following options are avail¬ 
able: 


• Fraction of data to display: choose the percent¬ 
age of data displayed on the screen. 

• Correlation method: choose among Pearson, 
Spearman or Maximum Information Coefficient 
(MIC). 

• Display numbers: choose if correlation coeffi¬ 
cients should be displayed in the heatmap. 

• Dendrogram type: choose among phylogram, 
cladogram or fan configuration^ 

• Graph layout: choose between spring and cir¬ 
cular configurations. 

• Chord diagram colour: choose among differ¬ 
ent colour schemes. 


2.2. Control Options 

Several control options are available on the portal 
to choose among different methods of analysis and 


^Visualizations inspired by phylogenetic tools (e.g., 


Paradis 


et al. 2004 1 . 
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• Number of PCs: choose the number or Princi¬ 
pal Components (PCs) to display as Nightin¬ 
gale charts. 

• PCA method: choose between standard or ro¬ 
bust Principal Components Analysis (PCA). 


following formula is used to calculate the Spearman 
coefficient, p: 


p = 1 - 


n{n^ - 1 )’ 


( 2 ) 


where di = is the difference between ranks. 


3. Methods 

In this section we briefly discuss the different meth¬ 
ods used by AMADA to analyse the datasets. 


3.1. Correlation methods 

The correlation analysis quantifies the strength of 
the association between a pair of variables, through 
a correlation coefficient. Its absolute value varies be¬ 
tween 0 (uncorrelated variables) and 1 (perfect as¬ 
sociation). Currently, AMADA offers three options 
of correlation measurements: linear (Pearson; |Pear- 
1895[ ), monotonic (Spearman; |Spearrnan[ |1904[ ) 


son 


and non-linear (MIC; Reshef et'aLj 20II[ ). We briefly 
present them in the following, and refer the reader to 
the original papers for more details. 


Pearson. This is widely employed in statistics to mea¬ 
sure the degree of the relationship between linearly 
related variables. The following formula is used to 
estimate the Pearson coefficient, r^, between two vari¬ 
ables Xi and Yf. 


rp = 


Y:U^i-X){Yi-Y) 


( 1 ) 


where X and Y represent the sample mean, and n the 
total number of objects in the dataset. 

Spearman rank correlation. This is a non-parametric 
method to measure the degree of monotonic associa¬ 
tion between two variables, and does not rely on any 
distributional assumption. For a dataset of size n, the 
variables A, and T, are converted to rank^ and the 


Maximal information coefficient. MIC (Reshef et al. 


2011 ) is founded under concepts of information the¬ 
ory (e.g., O |I990[). In this context, the Shannon 


entropy, 'H, can be understood as a measure of un¬ 
certainty of a random variable. For a single discrete 
distribution it can be written as 


'K(A) = - ^ p{a) log p(a). 


( 3 ) 


aeA 


while the joint entropy for a pair of discrete random 
variables (A,fi) with a joint distribution p{a,b) is de¬ 
fined as 

-K(A,fi) = - EE p(a,b)\ogp(a,b), (4) 

aeA beB 


where pia) and p{b) are the marginal probability mass 
functions (PMFs) of A and B, and p{a, b) is the joint 
PMF. Hence, the mutual information (MI) measures 
the amount of information that one random variable 
contains about another random variable. 


MI(A, B) 


EE p{a,b) log 

aeA beB 


I p{a,b) \ 

\p{a)p{b)) 


-K(A)--K(A,fi). (5) 


Consider D as a finite set of ordered pairs, {(a,, Z?,), i 
partitioned into a x-by-y grid of variable 
size, G, such that there are x-bins spanning a and 
y-bins covering b, respectively. The PMF of a par¬ 
ticular grid cell is proportional to the number of data 
points inside that cell. We can define a characteristic 
matrix M{D) of a set D as 


M(D) 




max(MI) 
log min{jc,y}’ 


( 6 ) 


^In statistics, ranking refers to the data transformation in 
which numerical or ordinal values are replaced by their rank 
when the data are sorted. For example, if the numerical data 
3.8, 5.4, 2.1, 10.3 are observed, the ranks of these data items 
would be 2, 3, 1 and 4 respectively. 


representing the highest normalized MI of D. The 
MIC of a set D is then defined as 

MIC(D)= max |m(D)^J, (7) 

0<xy<B{n) 1 > 


4 

























representing the maximum value of M subject to 0 < 
xy < B{n), where the function B(n) = ® was em¬ 

pirically determined by|Reshef et al.|(|2^011|). 


ance of the data projected onto it: 


ai = arg max S , • • • , a^x„), 

l|a||=l 


( 8 ) 


3.2. Principal Components Analysis 

The ultimate goal of PCA is to reduce the dimen¬ 
sionality of a multivariate dataset, while explaining 
the data variance with as few PCs as possible. Given 
its versatility, it has been applied to a broad range 
of astronomical studies, such as stellar, galaxy and 


quasar spectra (e.g., Chen et al. 2009 

McGurk et al. 

2010|), galaxy properties (Conselice 

2006 Scarlata 


et al.[ 2007), Hubble parameter and cosmic star for¬ 


mation reconstruction (e.g., |Ishida et al.[[2^011HIshida 


and de Souza 20111, and supernova photometric clas¬ 


sification dlshida and de Souzal|2013| ). 

PCA belongs to a class of Projection-Pursuit (PP; 
e.g., |Croux et al.[[2007| ) methods, whose aim is to de¬ 
tect structures in multidimensional data by projecting 
them onto a lower dimensional subspace (LDS). The 
LDS is selected by maximizing a projection index 
(PI), where PI represents a given feature in the data 
(trends, clusters, hyper-surfaces, anomalies, etc.). The 
particular case where variance (5^) is taken as a PI 
leads to the classical version of PCA0 The PCA scheme 
employed here falls into the category of filter meth¬ 
ods of feature selection. Their aim is to determine 
how relevant is a feature in representing a class in 
a high-dimensional space, but there exist other ap¬ 
proaches, i.e. the wrapper methods, that can be tai¬ 
lored to determine how relevant a feature is against 


a given classification task (see e.g., Donalek et al. 


20131 for a discussion of feature selection methods 


in astronomy). 

Given n parameters xi, • ■ • ,x„, all of them col¬ 
umn vectors of dimension T, the first PC is obtained 
by finding a unit vector a which maximizes the vari- 


^ The PCs are computed by diagonalization of the data 
covariance matrix (S^), with the resulting eigenvectors cor¬ 
responding to PCs and the resulting eigenvalues to the vari¬ 
ance explained by the PCs. The eigenvector corresponding to 
the largest eigenvalue gives the direction of greatest variance 
(PCI), the second largest eigenvalue gives the direction of the 
next highest variance (PC2), and so on. Since covariance ma¬ 
trices are symmetric positive semidefinite, the eigenbasis is or¬ 
thonormal (spectral theorem). 


where t is the transpose operation and ai is the di¬ 
rection of the first PC|^ Once we have computed the 
{k - l)th PC, the direction of the kth component, for 
1 < k ^ T, is given by 

ak = argmax 5^(a'jci, • ■ ■ ,a'ji;„), (9) 

||a||=l,a±ai,-,a±at_i 


where the condition of each PC to be orthogonal to 
all previous ones ensures a new uncorrelated basis. 
Despite of these attractive properties, the classical 
version of PCA has some critical drawbacks, as the 
sensitivity to outliers (e.g., Hampel et al. 20051. In 
order to overcome this limitation, several robust ver¬ 
sions were created. For instance, instead of taking 
the variance as a PI in equation ([^, a robust mea¬ 
sure of variance ( [Hoaglin et al.[ |2000| ) is taken, i.e. 


the median absolute deviation (MAD; e.g., Howell 


2005) of an ordered set k is given by 


MAD(a:i, ■ ■ • ,K„) = 1.48med|(A:y - med(/(:;)|), (10) 

j i 


where med represents the median of the sample, and 
the square of MAD gives the robust variance. The 
value of 1.48 represents where Qo.vs is the 0.75 
quantile of a normal distribution. AMADA allows 
the user to run a robust PCA based on the grid search 
base algorithm from|Croux et al.|(|2007|). 


3.3. Hierarchical Clustering 

A cluster analysis can be understood as a descrip¬ 
tive statistics to determine if a given dataset should 
be divided into different groups. The method aims to 
identify which groups of objects are similar to each 
other but different (or distant) from objects in other 
groups. There are several ways to define dissimilar¬ 
ity (or distance), according to each particular goal. 
Since we are interested in finding groups of variables 
highly correlated, it is natural to define the dissimi- 


® arg max f{x) is the set of values of x for which the function 
f{x) attains its largest value. 
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larity, T>, between properties as 


y,) = 1 - |Corr(X,-, y,)|, 


( 11 ) 


where Corr stands for eorrelation measurement. Thus, 
£)(X,, y,) = 0 represents perfeet eorrelation, while 
the value of D(X, , y,) = 1 indieates uneorrelated vari¬ 
ables. 

One of the main advantages of hierarehieal elus- 
tering methods is that a prior speeifioation of the num¬ 
ber of elusters to be searehed is not needed. Instead, 
the method requires a measurement of dissimilarity 
between groups of variables, whieh is based on the 
pairwise dissimilarities among the observations within 
eaeh of two groups. We employ an agglomerative 
approaeh, where eaeh variable is initially assigned to 
its own eluster, then the method reeursively merges a 
seleeted pair of elusters into a single one, where eaeh 
new pair is eomposed by merging the two groups 
with the smallest T> in the immediately lower level of 
the hierarehy. The lowest level represents eaeh sin¬ 
gle variable, while the highest level is a single eluster 
eontaining all variables. The final outeome is a hier¬ 
arehieal representations in whieh the elusters at eaeh 
level of the hierarehy are ereated by merging elusters 
at the next lower level. To guide the user in the task 
of seleeting a eertain sub-group of interest, we pro¬ 
vide an optimal number of elusters estimated via the 
Calihski and Harabasz| index dCalihski and Harabasz[ 


|1974| ). The tree-like final strueture ean be graphieally 
portrayed by e.g., dendrograms, graphs and ehord di¬ 
agrams, as diseussed in the following 


4. Visualization tools 

When dealing with a large amount of eomplex 
information, visualizing it in an intelligible way be- 
eomes a ehallenge. In this ease, the aim of a visual¬ 
ization method is to optimize the intuitive insight of 
the data strueture in order to exploit the pereeptual 
eapabilities of the human eye. Whilst the role of vi¬ 
sualization belongs to the groundwork of astronom- 
ieal analysis, new paradigms for multidimensional 
data visualization are not fully exploited, when eom- 
pared to other fields. Patterns, trends and eorrelations 
that might go undeteeted in tabular-based data, ean 
be revealed and more easily eommunieated with in- 
teraetive visualization tools. AMADA ineorporates 


eontemporary methods to visualize multidimensional 
data properties and their intrinsie eorrelations. This 
is partieularly relevant if one aims to have a physieal 
intuition of possible sub-populations of highly eorre- 
lated quantities, whieh are not neeessarily the domi¬ 
nant eomponents of the whole sample. In the follow¬ 
ing, we deseribe the main visual eapabilities of the 
paekage with a brief introduetion of eaeh methodol¬ 
ogy- 


4.1. Heatmap 

The eluster heatmap is a reetangular grid repre¬ 
sentation of a matrix with eluster trees appended to 
its margins. Its aim is to faeilitate inspeetion of elus- 
ter struetures in large matriees within a eompaet dis¬ 
played area. The method is broadly used in the bio- 
logieal seienees (Wilkinson and Friendly [120091 ), 
it is worth to eite its reeent applieation to solar data 
mining (Fig. 10 of Sehuh et al., 2015| ). 

In ease of a eorrelation matrix, the eolor assigned 
to a point in the heatmap grid indieates how mueh 
eaeh pair of variables eorrelates, as ean be seen in 
the typieal heatmap shown in Fig. For visual¬ 
ization purposes, the arrangement of the rows and 
eolumns is made following a hierarehieal elustering 
with a dendrogram drawn at the edges of the matrix. 
The figure portrays the heatmap of the moek galaxy 
eatalog from Guo et al. ( 2011[ ). Note the red square 
in the bottom right eomer of the panel, automatieally 
highlighting the trivial assoeiation between the mag¬ 
nitudes in the u,g,r,i, and z bands. Less trivial as- 
soeiations ean be identified more easily using for in- 
stanee a dendrogram visualization, as diseussed in 
the following seetion. 


4.2. Dendrogram 

A dendrogram provides a eomprehensive deserip- 
tion of the hierarehieal struetures in a visual format. 
Among the applieations in astronomieal researeh are 
the hierarehieal struetural analysis of interstellar prop¬ 
erties (Houlahan and Seaio||1992| ), moleeular elouds 
(Rosolowsky et al.[ |2008| ), and explanatory elassifi- 
eation of galaxies ( Fraix-Bumet et ahj 2012). The 
individual variables are arranged along the bottom of 
the dendrogram and referred to as leaf nodes. Clus¬ 
ters are formed by joining individual variables or ex¬ 
isting elusters, with the joint point referred to as a 
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Figure 2: Heatmap visualization of the correlation matrix (using a Pearson correlation measure) of some galaxy properties from the 
mock galaxy catalog by |Guo et al. (20111. Red indicates strong positive correlation and blue indicates strong negative correlation. 
Yellows are associated to correlations close to zero. 
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Figure 3: Dendrogram of the galaxy properties from the[Gii 


|et al.| ( |20TT] ) catalog. The different sub-groups of galaxy prop¬ 
erties, assigned using the Caliriski and Harabasz index, are col¬ 
ored according to the cluster assignment. 


node. At each dendrogram node we have a right and 
left sub-branch of clustered variables. The height of 
the node can be understood as the dissimilarity D 
between the right and left sub-branch clusters. 

Fig. [^displays a dendrogram of the galaxy prop¬ 
erties from the |Guo et al^ ( |2011| ) catalog, divided in 
10 major clusters (indicated by different colors) us¬ 


ing the Calihski and Harabasz index. The method au¬ 
tomatically suggests interesting associations among 
the galaxy properties, such as the M-band as an indi¬ 


cator of the star formation rate (SFR; see e.g. Gilbank 


et al. 2010). 


4.3. Graphs 

Graphs are powerful tools to represent multivari¬ 
ate data and their relationships. Examples of scien¬ 
tific applications are the analysis of cellular networks 
(Aittokallio and SchwikowsEj 2006), protein inter¬ 


actions (e.g.. Fig. 1 from |Aragues et aL| |2006| ), and 
brain disorders (Fig. 2 from|Fornito et al.[|2015[). A 


graph is defined by a set of vertices representing the 
objects of study, and a set of edges representing the 
relationships between them. There are many criteria 
forjudging an optimally drawn graph such as: 


• edge crossings should be minimized; 

• the vertices should be evenly distributed in the 
plane; 

• the graph should reflect intrinsic symmetries; 

• the edges should not cross nodes. 


Each item above can be understood as an optimiza¬ 
tion problem, which is the subject of interest of a re¬ 


search field known as graph drawing (e.g., Tamassia 


2007). There are several methods for graph repre¬ 


sentations. In this work we use the so-called spring- 
embedder algorithm (Eades[ 1984; Fruchterman and 


Reingoldj 19911. The underlying idea is to allow the 
vertices to behave like particles moving under the 
influence of repulsive and attractive forces until the 
system reaches equilibrium. This graph-drawing al¬ 
gorithm is particularly useful for graphs where the 
directions of the edges are not important, which is 
the case of a correlation matrix representation. Fig. 
displays the correlations among properties of galax¬ 
ies hosting Type la (left) and Type II (right) super¬ 
nova. Each vertex represents a galaxy property, while 
the thickness of the edges are weighted by the degree 
of correlation between each pair of variables ( |Ep 


skamp et ah] 2012| ). More specifically, the width and 
color of the edges correspond to the absolute value 
of the correlations: the higher the correlation, the 
thicker and more saturated the edge is. Highly corre¬ 
lated parameters appear closer in the graph. 

4.4. Chord diagram 

Chord diagram is a flexible and popular tool that 
has been used in many different applications, such as 
identification of relevant signatures in cancer genome 
(Eig. 1 from Bunting and Nussenzweig[ 2013), or 
study of the relation between foragers and farmers in 
Central Europe during the Stone Age (Eig. S5 from 
Bollongino et al.[[2013| ). 

In the case studied here, the chord diagram repre¬ 
sents another visualization of the correlation matrix, 
likewise the graph, heatmap and dendrogram. This 
tool illustrates relationships between distinct param¬ 
eters. The columns and rows are represented by seg¬ 
ments around the circle. Individual cells are shown 
as ribbons, which connect the corresponding row and 
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Figure 4; Graph representation of the host galaxy properties from Sako et al. ( 2014| l. The thickness of the edges are weighted by 
the degree of correlation between each pair of variables. The width and color correspond to the degree of association; the higher 
the correlation, the thicker and more color saturated the edge is. The left (right) side represents the properties of Type la (Type II) 
supernova host galaxies. 


column segments ( |Gu et al.j 2014). The thickness of 
the ribbons is weighted by the degree of eorrelation 
between eaeh pair of variables. Fig. [^portrays the 
correlations among supernova Type la/II host galaxy 
properties. For a given choice of colour palette, the 
colour intensity ranges from fully anti-correlated to 
eorrelated values. 

4.5. Nightingale chart 

The last plot is inspired by the original Nightin¬ 
gale chart This 

is one of the most influential statistieal visualizations 
of all time, used by Florence Nightingale to convince 
Queen Vietoria about improving hygiene in military 
hospitals (see also [Draper et al.[ |2009[ for a review 
of radial methods in information visualization). 

We show it as a polar bar plot, where the length 
of eaeh slice represents the relative eontribution of 
eaeh variable to the z-th Prineipal Component. Fig 
1^ displays the eontributions of the supernova Type 
la/II host galaxy properties for the first and second 
principal component^ 


5. Summary 

We have presented the AMADA package, a web 
application for interactive exploration and informa¬ 
tion retrieval of high-dimensional datasets. This is 
designed for high-dimensional catalogs, with a wide 
range of applications. There are, though, some lim¬ 
itations in terms of data-size and performance. In 
particular, SHINY allows to upload in the applica¬ 
tion only up to 1GB of data. Thus, the SHINY server 
should be mostly used for a quiek exploration of the 
paekage features, so that the user can skip the in¬ 
stallation step to familiarize with the eode, while we 
reeommend to run AMADA loeally (as explained in 


Appendix Aj ) when applied to a real seientifie prob¬ 
lem. In addition, the speed performance of some 
methods, such as the hierarchieal clustering, may not 
scale well with very large datasets. As a reference, 
the processing time to produce a dendrogram from 
a matrix with 100,000 objects and 100 columns was 


face does not work well with more than 4 PCs simultaneously 
displayed on the screen. This limitation can be potentially fixed 
by tweaking the figure dimensions, if e.g. a PDF file is produced 


'We should warn the reader that currently the SHINY inter- using the R command line (see[Appendix A i. 
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Figure 5; A chord diagram representing the Pearson correlations among the galaxy properties hosting Type la (left panel), and Type 
11 supernovae (right panel). 


~ 1.5 seconds on an iMac featuring a 3,5 GHz Intel 
Core iV and 32 GB of ram memory. An example of 
the script to reproduce this test is given below, 


1 require(AMADA) 

2 N = 100000#Number of rows 

3 M= 100# Number of columns 

4 Ml<-matrix(rnorm(N*M,mean=0,sci=l) , N, M) 

5 ptm <- proc.timeO 

6 corr<-Corr_MIC(Ml , "pearson" ) 

7 Figl<-plotdendrogram( corr, "fan" ) 

8 proc.timeO - ptm 

Therefore, despite some limitations, we expect the 
current version of the package to be suitable for a 
wide variety of astronomical catalogs. Since this is a 
software release paper, we avoided a detailed scien¬ 
tific discussion on the available datasets, which here 
have been used merely as a proof of concept. How¬ 
ever, it is worth mentioning that AMADA automati¬ 
cally recovers and displays trivial and non-trivial cor¬ 
relations. An example of the former is the correla¬ 
tion between the u, g, r, z and i magnitudes of su¬ 
pernova host galaxies as seen in Fig. Q while an ex¬ 
ample of the latter is the association between the star 
formation rate and u-band magnitude in the galaxy 
mock catalog as shown in Fig. It is important to 


mention that few methods herein implemented are a 
later development of a previous work from the au¬ 
thors making use of MIC statistics and robust PCA 
to understand the redshift dependence of halo bary- 
onic properties in the early Universe (|de Souza et al.|, 


2014). We therefore refer the reader to this work as 


an example of application in a cosmological context 
of the methods discussed here. 

The code is freely available on github and can 
be run both online and locally. This work is part of a 
larger enterprise known as Cosmostatistics Initiative 
(COINI^ whose philosophy is to enable astronomers 
to easily introduce novel techniques into their daily 
research. This is an open-source project, and we ex¬ 
pect to continuously add extra features. Therefore, 
we encourage the users to contact the authors with 
suggestions, while potential contributors and devel¬ 
opers can fork the AMADA repository on github|^ 
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Figure 6: A Nightingale diagram representing the contributions of the galaxy properties hosting Type la (left panel) and Type II 
(right panel) supernovae. 


11 















M. L. Dantas and T. Kitching for testing AMADA on 
their respeetive maehines. We thank the eonstruetive 
suggestions of the referee. The lAA Cosmostatis- 
ties Initiative (C0IN|^ is a non-profit organization 
whose aim is to nourish the synergy between astro- 
physies, eosmology, statisties and maehine learning 
eommunities. 

References 

Aittokallio, T., Schwikowski, B., 2006. Graph-based methods 
for analysing networks in cell biology. Briefings in Bioin¬ 
formatics 7 (3), 243-255. 

URL http://bib.oxfordjournals.org/ 

content/7/3/2 43.abstract 
Antonuccio-Delogu, V., Dobrotka, A., Becciani, U., Cielo, S., 
Giocoli, C., Maccio, A. V., Romeo-Velona, A., Sep. 2010. 
Dissecting the spin distribution of dark matter haloes. MN- 
RAS 407, 1338-1346. 

Aragues, R., Jaeggi, D., Oliva, B., 2006. Piana: protein 
interactions and network analysis. Bioinformatics 22 (8), 
1015-1017. 

URL http://bioinformatics. 

oxfordjournals.org/content/22/8/1015. 
abstract 

Ball, N. M., Brunner, R. J., 2010. Data Mining and Machine 
Learning in Astronomy. International Journal of Modern 
Physics D 19, 1049-1106. 

Bollongino, R., Nehlich, O., Richards, M. P, Orschiedt, J., 
Thomas, M. G., Sell, C., Fajkosova, Z., Powell, A., Burger, 
J., 2013. 2000 years of parallel societies in stone age central 
europe. Science 342 (6157), 479^81. 

URL http://WWW.sclencemag.org/content/ 

342/6157/479.abstract 

Borne, K., Becla, J., Davidson, I., Szalay, A., Tyson, J. A., Dec. 
2008. The LSST Data Mining Research Agenda. In; Bailer- 
Jones, C. A. L. (Ed.), American Institute of Physics Con¬ 
ference Series. Vol. 1082 of American Institute of Physics 
Conference Series, pp. 347-351. 

Brescia, M., Longo, G., Djorgovski, G. S., Cavuoti, S., 
D’Abrusco, R., Donalek, C., Di Guido, A., Fiore, M., Garo- 
falo, M., Laurino, O., Mahabal, A., Manna, F., Nocella, 
A., d’Angelo, G., Paolillo, M., Oct. 2010. DAME: A Web 
Oriented Infrastructure for Scientific Data Mining & Explo¬ 
ration. ArXiv e-prints. 

Bunting, S. F, Nussenzweig, A., Jul. 2013. End-joining, 
translocations and cancer. Nat Rev Cancer 13 (7), 443^54. 
URL http://dx.doi.org/10.1038/nrc3537 
Burger, D., Stassun, K. G., Pepper, J., Siverd, R. J., Paegert, M., 
De Lee, N. M., Robinson, W. H., Aug. 2013. Filtergraph: An 
interactive web application for visualization of astronomy 
datasets. Astronomy and Computing 2, 40^5. 


https : //asaip . psu . edu/organizations/ 
iaa/iaa-working-group-of-cosmostatisties 


Burgess, J. M., Preece, R. D., Ryde, E, Veres, P, Meszaros, 
P, Connaughton, V., Briggs, M., Pe’er, A., lyyani, S., et a., 
Apr. 2014. An Observed Correlation between Thermal and 
Non-thermal Emission in Gamma-Ray Bursts. ApJ784, L43. 

Calinski, T, Harabasz, J., 1974. A dendrite method for cluster 
analysis. Communications in Statistics 3 (1), 1-27. 

URL http://WWW.tandfonline.com/doi/abs/ 
10.1080/03610927408827101 

Carilli, C. L., Aug. 2014. Square Kilometre Array key science: 
a progressive retrospective. ArXiv e-prints. 

Chakraborty, A., Feigelson, E. D., Babu, G. J., Mar. 2013. VO- 
Stat; A Statistical Web Service for Astronomers. PASP125, 
295-305. 

Chen, Y.-M., Wild, V., Kauffmann, G., Blaizot, J., Davis, M., 
Noeske, K., Wang, J.-M., Willmer, C., Feb. 2009. Con¬ 
straints on the star formation histories of galaxies from z ~ 
1 to 0. MNRAS 393, 406-418. 

Cohen, I. B., Mar. 1984. Florence nightingale 250 (3), 128- 
137. 

URL http://www.nature.com/ 

scientificamerican/journal/v250/n3/pdf/ 
scientificamerican0384-128.pdf 

Conselice, C. J., Dec. 2006. The fundamental properties 
of galaxies and a new galaxy classification system. MN¬ 
RAS 373, 1389-1408. 

Croux, C., Filzmoser, P, Oliveira, M., Mar. 2007. Algorithms 
for Projection/Pursuit robust principal component analysis. 
Chemometrics and Intelligent Laboratory Systems 87 (2), 
218. 

de Souza, R. S., Cameron, E., Killedar, M., Hilbe, J., Vilalta, 
R., Maio, U., Biffi, V., Ciardi, B., Riggs, J. D., Sep. 2015. 
The Overlooked Potential of Generalized Linear Models in 
Astronomy -1: Binomial Regression and Numerical Simula¬ 
tions. arXiv: 1409.7696. 

de Souza, R. S., Ciardi, B., Maio, U., Ferrara, A., Jan. 2013a. 
Dark matter halo environment for primordial star formation. 
MNRAS 428,2109-2117. 

de Souza, R. S., Ishida, E. E. O., Johnson, J. L., Whalen, D. J., 
Mesinger, A., Dec. 2013b. Detectability of the first cosmic 
explosions. MNRAS 436, 1555-1563. 

de Souza, R. S., Ishida, E. E. O., Whalen, D. J., Johnson, 
J. L., Ferrara, A., Aug. 2014b. Probing the stellar initial mass 
function with high-z supernovae. MNRAS 442, 1640-1655. 

de Souza, R. S., Maio, U., Biffi, V., Ciardi, B., May 2014. Ro¬ 
bust PCA and MIC statistics of baryons in early minihaloes. 
MNRAS 440, 240-248. 

de Souza, R. S., Maio, U., Biffi, V., Ciardi, B., May 2014a. Ro¬ 
bust PCA and MIC statistics of baryons in early minihaloes. 
MNRAS 440, 240-248. 

Donalek, C., Djorgovski, S., Mahabal, A., Graham, M., Drake, 
A., Fuchs, T., Turmon, M., Arun Kumar, A., Philip, N., 
Yang, M.-C., Longo, G., Oct 2013. Feature selection strate¬ 
gies for classifying high dimensional astronomical data sets. 
In; Big Data, 2013 IEEE International Conference on. pp. 
35 ^ 1 . 

Draper, G., Livnat, Y, Riesenfeld, R., Sept 2009. A survey of 


12 




radial methods for information visualization. Visualization 
and Computer Graphics, IEEE Transactions on 15 (5), 759- 
776. 

Eades, R, 1984. A heuristic for graph drawing. Congressus Nu- 
merantium42, 149-160. 

Epskamp, S., Cramer, A. O., Waldorp, L. J., Schmittmann, 

V. D., Borsboom, D., 5 2012. qgraph: Network visualiza¬ 
tions of relationships in psychometric data. Journal of Sta¬ 
tistical Software 48 (4), 1-18. 

URL http://www.jstatsoft.org/v48/i04 

Eaber, S. M., Jackson, R. E., Mar. 1976. Velocity dispersions 
and mass-to-light ratios for elliptical galaxies. ApJ204, 668- 
683. 

Eisher, R. A., 1936. The use of multiple measurements in 
taxonomic problems. Annals of Eugenics 7 (2), 179-188. 
URL http://dx.doi.Org/10.llll/j. 

1469-1809.1936.tb02137 .X 
Fornito, A., Zalesky, A., Breakspear, M., Feb. 2015. The con- 
nectomics of brain disorders. Nature Reviews Neuroscience 
16 (3), 159-172. 

URL http://dx.doi.org/10.1038/nrn3901 
Fraix-Burnet, D., Chattopadhyay, T., Chattopadhyay, A. K., 
Davoust, E., Thuillard, M., Sep. 2012. A six-parameter space 
to describe galaxy diversihcation. A&A545, A80. 
Fruchterman, T. M. J., Reingold, E. M., 1991. Graph drawing 
by force-directed placement. Softw., Pract. Exper. 21 (11), 
1129-1164. 

URL http://dblp.uni-trier.de/db/ 

journals/spe/spe21.html#FruchtermanR91 

Gilbank, D. G., Baldry, I. K., Balogh, M. L., Glazebrook, K., 
Bower, R. G., Jul. 2010. The local star formation rate den¬ 
sity: assessing calibrations using [Oil], H and UV luminosi¬ 
ties. MNRAS 405, 2594-2614. 

Graham, M. J., Djorgovski, S. G., Mahabal, A. A., Donalek, 
C., Drake, A. J., May 2013. Machine-assisted discovery of 
relationships in astronomy. MNRAS 431, 2371-2384. 

Gu, Z., Gu, L., Eils, R., Schlesner, M., Brors, B., 2014. 
circlize implements and enhances circular visualization in r. 
Bioinformatics. 

URL http://bioinformatics. 

oxfordjournals.org/content/early/2014/ 
06/14/bioinformatics.btu393.abstract 

Guo, Q., White, S., Boylan-Kolchin, M., De Lucia, G., Kauff- 
mann, G., Lemson, G., Li, C., Springel, V., Weinmann, S., 
May 2011. From dwarf spheroidals to cD galaxies: simu¬ 
lating the galaxy population in a ACDM cosmology. MN¬ 
RAS 413, 101-131. 

Hamaus, N., Wandelt, B. D., Sutter, P. M., Lavaux, G., War¬ 
ren, M. S., Jan. 2014. Cosmology with Void-Galaxy Corre¬ 
lations. Physical Review Letters 112 (4), 041304. 

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., Stahel, 

W. A., 2005. Front Matter. John Wiley & Sons, Inc. 

URL http://dx.doi.org/10.1002/ 

9781118186435.fmatter 

Hoaglin, D. C., Mosteller, E, (Editor), J. W. T, 2000. Under¬ 
standing Robust and Exploratory Data Analysis, 1st Edition. 


Wiley-Interscience. 

Houlahan, P, Scalo, J., Jul. 1992. Recognition and characteri¬ 
zation of hierarchical interstellar structure. II - Structure tree 
statistics. ApJ393, 172-187. 

Howell, D. C., 2005. Median Absolute Deviation. John Wiley 
& Sons, Ltd. 

URL http://dx.doi.org/10.1002/ 

0470013192.bsa384 

Ishida, E. E. O., de Souza, R. S., Mar. 2011. Hubble parameter 
reconstruction from a principal component analysis: mini¬ 
mizing the bias. A&A527, A49. 

Ishida, E. E. O., de Souza, R. S., Mar. 2013. Kernel PCA for 
Type la supernovae photometric classification. MNRAS 430, 
509-532. 

Ishida, E. E. O., de Souza, R. S., Ferrara, A., Nov. 2011. Prob¬ 
ing cosmic star formation up to z= 9.4 with gamma-ray 
bursts. MNRAS 418, 500-504. 

Kembhavi, A. K., Mahabal, A. A., Kale, T., Jagade, S., Vibhute, 
A., Garg, P, Vaghmare, K., Navelkar, S., Agrawal, T., Nan- 
drekar, D., Shaikh, M., Mar. 2015. AstroStat - A VO Tool 
for Statistical Analysis, arxiv: 1503.02989. 

Konstantopoulos, I. S., Apr. 2015. The starhsh diagram: Visual¬ 
ising data within the context of survey samples. Astronomy 
and Computing 10, 116-120. 

Lee, B., Giavalisco, M., Williams, C. C., Guo, Y., Lotz, J., Van 
der Wei, A., Ferguson, H. C., Faber, S. M., Koekemoer, A., 
Grogin, N., Kocevski, D., Conselice, C. J., Wuyts, S., Dekel, 
A., Kartaltepe, J., Bell, E. R, Sep. 2013. CANDELS: The 
Correlation between Galaxy Morphology and Star Forma¬ 
tion Activity at z ~ 2. ApJ774, 47. 

Li, W., 1990. Mutual information functions versus correlation 
functions. Journal of Statistical Physics 60 (5-6), 823-837. 
URL http://dx.doi.org/10.1007/BF01025996 

LSST Science Collaboration, Abell, P. A., Allison, J., Ander¬ 
son, S. E, Andrew, J. R., Angel, J. R. P, Armus, L., Arnett, 
D., Asztalos, S. J., Axelrod, T. S., et ah, Dec. 2009. LSST 
Science Book, Version 2.0. arXiv:0912.0201. 

Martinez-Gomez, E., Richards, M. T., Richards, D. S. R, Aug. 
2013. Distance Correlation Methods for Discovering Asso¬ 
ciations in Large Astrophysical Databases. arxiv:1308.3925. 

McDonald, L., 2001. Florence nightingale and the early origins 
of evidence-based nursing. Evidence Based Nursing 4 (3), 
68-69. 

URL http://ebn.bmj.com/content/4/3/68. 
short 

McGurk, R. C., Kimball, A. E., Ivezic, Z., Mar. 2010. Princi¬ 
pal Component Analysis of Sloan Digital Sky Survey Stellar 
Spectra. AJ139, 1261-1268. 

Overzier, R., Lemson, G., Angulo, R. E., Bertin, E., Blaizot, J., 
Henriques, B. M. B., Marleau, G.-D., White, S. D. M., Jan. 
2013. The Millennium Run Observatory: hrst light. MN¬ 
RAS 428, 778-803. 

Paradis, E., Claude, J., Strimmer, K., 2004. Ape: Analyses of 
phylogenetics and evolution in r language. Bioinformatics 
20 (2), 289-290. 

UlTL http ://bioinformatics . 


13 


oxfordjournals.org/content/20/2/289 

Patty, J. W., Penn, E. M., 1 2015. Analyzing big data; Social 
choice and measurement. PS: Political Science & Politics 
48, 95-101. 

URL http://journals.Cambridge.org/ 

article_S1049096514001814 

Pearson, K., 1895. Note on regression and inheritance in the 
case of two parents. Proceedings of the Royal Society of 
London 58 (347-352), 240-242. 

URL http://rspl.royalsocietypublishing. 
org/content/58/347-352/240.short 
Reshef, D. N., Reshef, Y. A., Linucane, H. K., Grossman, S. R., 
McVean, G., Tumbaugh, P. J., Lander, E. S., Mitzenmacher, 
M., Sabeti, P. C., Dec. 2011. Detecting Novel Associations 
in Large Data Sets. Science 334, 1518-. 

Rosolowsky, E. W., Pineda, J. E., Kauffmann, J., Goodman, 
A. A., Jun. 2008. Structural Analysis of Molecular Clouds: 
Dendrograms. ApJ679, 1338-1351. 

Sako, M., Bassett, B., Becker, A. C., Brown, P. J., Campbell, 
H., Cane, R., Cinabro, D., D’Andrea, C. B., et al., Jan. 2014. 
The Data Release of the Sloan Digital Sky Survey-II Super¬ 
nova Survey. arxiv;1401.3317. 

Scaramella, R., Mellier, Y, Amiaux, J., Burigana, C., Car¬ 
valho, C. S., Cuillandre, J. C., da Silva, A., Dinis, J., 
Derosa, A., Maiorano, E., Lranzetti, R, Garilli, B., Maris, 
M., Meneghetti, M., Tereno, I., Wachter, S., Amendola, L., 
Cropper, M., Cardone, V., Massey, R., Niemi, S., Hoekstra, 
H., Kitching, T., Miller, L., Schrabback, T., Semboloni, E., 
Taylor, A., Viola, M., Maciaszek, T., Ealet, A., Guzzo, L., 
Jahnke, K., Percival, W., Pasian, E, Sauvage, M., the Euclid 
Collaboration, Jan. 2015. Euclid space mission: a cosmolog¬ 
ical challenge for the next 15 years. ArXiv e-prints. ^ 

Scarlata, C., Carollo, C. M., Lilly, S., Sargent, M. T, Eeld- 
mann, R., Kampczyk, P, Porciani, C., Koekemoer, A., et al., 
Sep. 2007. COSMOS Morphological Classification with the 
Zurich Estimator of Structural Types (ZEST) and the Evolu¬ 
tion Since z = 1 of the Luminosity Eunction of Early, Disk, 
and Irregular Galaxies. ApJS172, 406-433. 

Schuh, M. A., Banda, J. M., Wylie, T., Mclnemey, P, Pillai, 
K. G., Angryk, R. A., Apr. 2015. On visualization techniques 
for solar data mining. Astronomy and Computing 10, 32^2. 
Spearman, C., 1904. The proof and measurement of association 
between two things. The American Journal of Psychology 
15 (l),pp. 72-101. 1 

URL http://WWW.jstor.org/stable/1412159 2 

Springel, V., Dec. 2005. The cosmological simulation code 
GADGET-2. MNRAS 364, 1105-1134. 

Tamassia, R., 2007. Handbook of Graph Drawing and Visual¬ 
ization (Discrete Mathematics and Its Applications). Chap¬ 
man & Hall/CRC. 

Tully, R. B., Eisher, J. R., Eeb. 1977. A new method of deter- * 
mining distances to galaxies. A&A54, 661-673. ^ 

van Zyl, T., 2015/03/03 2014. Machine Learning on Geospatial 
Big Data. CRC Press, pp. 133-148. 

URL http://dx.doi.org/10.1201/b16524-8 
Venter, J. C., Remington, K., Heidelberg, J. E, Halpern, A. L., 


Rusch, D., Eisen, J. A., Wu, D., Paulsen, I., Nelson, K. E., 
Nelson, W., Louts, D. E., Levy, S., Knap, A. H., Lomas, 
M. W., Nealson, K., White, O., Peterson, J., Hoffman, J., 
Parsons, R., Baden-Tillson, H., Pfannkoch, C., Rogers, 
Y.-H., Smith, H. O., 2004. Environmental genome shotgun 
sequencing of the sargasso sea. Science 304 (5667), 66-74. 
URL http://WWW.sciencemag.org/content/ 
304/5667/66.abstract 

Vogelsberger, M., Genel, S., Springel, V, Torrey, P, Sijacki, D., 
Xu, D., Snyder, G., Nelson, D., Hernquist, L., Oct. 2014. In¬ 
troducing the Illustris Project: simulating the coevolution of 
dark and visible matter in the Universe. MNRAS 444, 1518- 
1547. 

Wilkinson, L., Eriendly, M., 2009. The history of the cluster 
heat map. The American Statistician 63 (2), 179-184. 

URL http://dx.doi.org/10.1198/tas.2009. 
0033 

Yates, R. M., Kauffmann, G., Guo, Q., May 2012. The rela¬ 
tion between metallicity, stellar mass and star formation in 
galaxies; an analysis of observational and model data. MN¬ 
RAS 422,215-231. 

Appendix A. Running AMADA locally 

Appendix A. 1. From Shiny 

To install and run the interface, the first step is to 
have R in your computeip^ Thereafter, you have to 
install the following R packages: 

install .packages(c( "ape" , "circlize" , 

corrplot" , "devtools" , "fpc" , "ggplot2"<-^ 

, "ggthemes" , "MASS" , "markdown" , 
mclust" , "minerva" , "mvtnorm" , "pcaPP" , 
"pheatmap" , "phytools " , "qgraph" , 
RColorBrewer" , "RCurl" , "squash" , 
stats" , "shiny") ,ciependencies=TRUE) 

We are now read to install AMADA from GitHub 
repository: 

require (devtools) 

install_github( "Raf aelSdeSouza/AMADA" ) 

An alternative simpler option is to type the following 
command 

require (devtools) 

install_github( "COINtoolbox/AMADA" 

dependencies=TRUE ) 


"http://WWW.r-project.org 
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and R will automatically install the necessary depen¬ 
dencies to run AM ADA. After installing the AM ADA 
package, the user ean run the visual interface with the 
following command: 


reqpaire( shiny) 

runUrl ( "https : //github . com/COINtoolbox/ 
AMADA_shiny/archive/master .zip" ) 

AMADA can also be used direetly via the web. 
This option requires no loeal installation, but the ac¬ 
tual processing may be slower. This web interface 
is hosted by the shinyapps.io platformp^ and can be 
accessed directly at http : / /goo . gl/UTnU7I. 

Appendix A. 2. From R command line 

If the user prefer to run AMADA on its own data 
without relying on the shiny interface, it can be done 
direetly from R eommand line. An example of how 
to produee a dendrogram of the Type la supernova 
dataset and saving it as a PDF file is presented below: 


require (AMADA) #Load the package 
data( "SNIa" ) #Load the SNIa data 

corr<-Corr_MIC( SNIa, "pearson" ) 
Figl<-plotdendrogram( corr, "phylogram" ) 

To save the figure as pdf file, with a eustomized 
height and width, just type the following: 


pdf( "phylogram.pdf", height = 8,width = 8) 
Figl 

dev.off 0 

Examples of how the use the other funetions inside 
R can be found in the deseription file, whieh ean be 
access via the eommancC^ 


help(package="AMADA" ) 

In the eurrent paekage version, the layout of the fig¬ 
ures is mostly hardeoded, but it ean be easily ehanged 
inside the souree eode. We expeet to add more flexi¬ 
bility in future versions. 


*^http : / /WWW . shinyapps . io 

'^We should stress that the functions to display the chord di¬ 
agram and the heatmap are basically convenient wrappers to the 
functions available in the packages PHEATMAP and CIRCLIZE. 




